Incorporating Dialectal Variability for Socially Equitable Language Identification
نویسندگان
چکیده
Language identification (LID) is a critical first step for processing multilingual text. Yet most LID systems are not designed to handle the linguistic diversity of global platforms like Twitter, where local dialects and rampant code-switching lead language classifiers to systematically miss minority dialect speakers and multilingual speakers. We propose a new dataset and a character-based sequence-tosequence model for LID designed to support dialectal and multilingual language varieties. Our model achieves state-of-theart performance on multiple LID benchmarks. Furthermore, in a case study using Twitter for health tracking, our method substantially increases the availability of texts written by underrepresented populations, enabling the development of “socially inclusive” NLP tools.
منابع مشابه
From perceptual designs to linguistic typology and automatic language identification : overview and perspectives
This paper deals with the overview of the methods in perceptual language identification and the suggestion of a new approach based on a two-step methodology integrating to perception “genetic” considerations and resulting into the modeling of perceptually identified discriminative cues. The first study reported here concerns experimental designs for perceptual and automatic identification of th...
متن کاملDemographic Dialectal Variation in Social Media: A Case Study of African-American English
Though dialectal language is increasingly abundant on social media, few resources exist for developing NLP tools to handle such language. We conduct a case study of dialectal language in online conversational text by investigating African-American English (AAE) on Twitter. We propose a distantly supervised model to identify AAE-like language from demographics associated with geo-located message...
متن کاملSpanish dialects: phonetic transcription
It is well known that canonical Spanish, the dialectal variant ‘central’ of Spain, so called Castilian, can be transcribed by rules. This paper deals with the automatic grapheme to phoneme transcription rules in several Spanish dialects from Latin America. Spanish is a language spoken by more than 300 million people, has an important geographical dispersion compared among other languages and ha...
متن کاملIdentification and handling of dialectal variation with a single grammar
We present a study on approaches to handle variation in a deep natural language processing formalism. It allows a grammar to be parameterized as to what language variants it accepts, but also to detect such variants. In this respect, we compare it to standard language identification methods, employed here to detect variation in the same language.
متن کاملSetting parametric limits on dialectal variation in Spanish*
The present investigation departs from the perspective that dialects of languages may exemplify typological distinctions, and as such, may be defined within parametric limits. More specifically, this synchronic study focuses on the interand intra-dialectal variation attested within the Spanish language, heretofore exempted from the scrutiny that has characterized syntactic studies of other Roma...
متن کامل